Scholarship Project
  • Introduction
  • Data Cleaning
  • EDA
  • Analysis
    • Overview
    • RCP8.5
    • RCP4.5
  • Furture Actions

On this page

  • Introduction
  • Feature Selection
  • Basic Analysis 4.5 vs 8.5
    • Scatterplot
      • Useful Variables
    • Pearson Correlation
      • Methodology
      • RCP4.5
      • RCP8.5
    • RMSE
  • Conclusion

Dataset Analysis Overview

Author

JaeHo Bahng

Published

May 19, 2024


Introduction

Before we dive into scenarios, lets conduct a simple comparison on the overall trend of RCP 4.5 and 8.5. What variables influence annual temperature the most when grouped into two RCP scenarios as a whole?

Methodology

We will be using scatterplots and pearson correlation and RMSE to find similarities and differences between the two RCP scenarios.

Import module / Set options and theme
import pandas as pd
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import xml.etree.ElementTree as ET
import plotly.express as px
import plotly.graph_objects as go
from scipy.stats import ttest_rel
from statsmodels.stats.weightstats import ttest_ind
import pingouin as pg
from scipy.stats import zscore
from plotly.subplots import make_subplots
import warnings
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.manifold import TSNE

warnings.filterwarnings("ignore")

pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 10)

Feature Selection

When trying to find correlations between Annual Temperature, the temperature related features such as Tmin, Tmax, T_Seasonal always turned out to have a high correlation with annual temperature and therefore acted as a hinderance for proper analysis. I removed all temperature related variables to clearly investigate what properties affect the temperature.

Import cleaned data
df = pd.read_csv('../data/cleaned_df.csv')
df['Location_ID'] = df.groupby(['long', 'lat']).ngroup() + 1

group_list = ['Park', 'long', 'lat', 'veg', 'year', 'TimePeriod', 'RCP','treecanopy', 'Ann_Herb', 'Bare', 'Herb', 'Litter', 'Shrub', 'El', 'Sa','Cl', 'RF', 'Slope', 'E', 'S']
veg_location = df.drop(labels='scenario',axis=1).groupby(group_list).mean().reset_index()

numeric_series = pd.to_numeric(veg_location['RCP'], errors='coerce')

veg_location['RCP'] = numeric_series.fillna(veg_location['RCP'])

four = veg_location[veg_location['RCP'].isin([4.5])]
eight = veg_location[veg_location['RCP'].isin([8.5])]
four_h = veg_location[veg_location['RCP'].isin(['historical'])]
four_h['RCP'] = 4.5
eight_h = veg_location[veg_location['RCP'].isin(['historical'])]
eight_h['RCP'] = 8.5

df_con = pd.concat([four_h, four, eight_h, eight], ignore_index=True)
df_con['Location_ID'] = df_con.groupby(['long', 'lat']).ngroup() + 1





numeric_series = pd.to_numeric(df['RCP'], errors='coerce')

numeric_series


df['RCP'] = numeric_series.fillna(df['RCP'])

four = df[df['RCP'].isin([4.5])]
eight = df[df['RCP'].isin([8.5])]
four_h = df[df['RCP'].isin(['historical'])]
four_h['RCP'] = 4.5
eight_h = df[df['RCP'].isin(['historical'])]
eight_h['RCP'] = 8.5

df_orig = pd.concat([four_h, four, eight_h, eight], ignore_index=True)
df_orig['Location_ID'] = df_orig.groupby(['long', 'lat']).ngroup() + 1

selected_columns = [col for col in df.columns if not col.startswith(('T_', 'Tmin', 'Tmax'))]
dropped_columns = [col for col in df.columns if col.startswith(('T_', 'Tmin', 'Tmax'))]
filtered_df = df_orig[selected_columns]
filtered_df['T_Annual'] = df_orig['T_Annual']

df_orig = filtered_df

print("Dropped Columns : ", dropped_columns)
Dropped Columns :  ['T_P_Corr', 'T_Winter', 'T_Spring', 'T_Summer', 'T_Fall', 'Tmax_Winter', 'Tmax_Spring', 'Tmax_Summer', 'Tmax_Fall', 'Tmin_Winter', 'Tmin_Spring', 'Tmin_Summer', 'Tmin_Fall', 'T_Annual']

Basic Analysis 4.5 vs 8.5

Scatterplot

With a basic scatterplot, we can see basic correlations of how each numerical variable correlates to either the annual temperature or the annual percipitation. Since RCP 8.5 and RCP 4.5 have different predictions, two plots were used for each scenario.

Firstly, without an additional feature, we can see that the more percipitation, the lower the annual temperature because we can easily draw a line with a negative slope through the scaterred plots.

4.5 vs 8.5 scatterplot
# Assuming df_con is your DataFrame and is already loaded
# List of columns to use for coloring
test = df_con.iloc[:,list(range(1, 3))+ [4,6] + list(range(8, len(df_con.columns)-1))]
color_columns = list(test.columns)
rcp_values = test['RCP'].unique()

subplot_titles = [f'RCP {rcp}' for rcp in rcp_values]

fig = make_subplots(rows=1, cols=len(rcp_values), shared_yaxes=True, subplot_titles=subplot_titles, horizontal_spacing=0.15)

for i, col in enumerate(color_columns):
    for j, rcp in enumerate(rcp_values):
        fig.add_trace(
            go.Scatter(
                x=test[(test['year'].isin(range(2060, 2100))) & (test['RCP'] == rcp)]['PPT_Annual'],
                y=test[(test['year'].isin(range(2060, 2100))) & (test['RCP'] == rcp)]['T_Annual'],
                mode='markers',
                marker=dict(
                    color=test[(test['year'].isin(range(2060, 2100))) & (test['RCP'] == rcp)][col],
                    colorbar=dict(
                        # title='Scale',
                                  tickmode='array',
                                  tickvals=[round(i,2) for i in np.linspace(start=round(min(test[(test['year'].isin(range(2060, 2100)) & (test['RCP'] == rcp))][col]),2),stop=round(max(test[(test['year'].isin(range(2060, 2100)) & (test['RCP'] == rcp))][col]),2),num=5)],
                                  ticktext=[round(i,2) for i in np.linspace(start=round(min(test[(test['year'].isin(range(2060, 2100)) & (test['RCP'] == rcp))][col]),2),stop=round(max(test[(test['year'].isin(range(2060, 2100)) & (test['RCP'] == rcp))][col]),2),num=5)],
                                  y=0.5,
                                  x= 0.43 + (j*0.58)
                                  ),
                                  colorscale='rdpu'
                ),
                name=col,
                visible=True if i == 0 else False,
                hovertemplate=(
                    f"<b>{col}</b><br>"
                    "Precipitation: %{x}<br>"
                    "Temperature: %{y}<br>"
                    "RCP: " + str(rcp) + "<br>"
                    "Value: %{marker.color}<br>"
                    "<extra></extra>"
                  ) 
            ),
            row=1, col=j+1
        )


fig.update_layout(
    title={
        'text': '<b>Annual Precipitation vs Temperature by RCP Scenarios</b>',
        'x': 0.5,
        'y': 0.97,
        'xanchor': 'center'
    },
    showlegend=False
)


dropdown_buttons = [
    {
        'label': col,
        'method': 'update',
        'args': [
            {
                'visible': [col == color_column for color_column in color_columns for _ in rcp_values]
            },
            {
                'title': {'text': f'<b>Annual Precipitation vs Temperature by {col}</b>', 'x':0.5, 'y':0.97},
                'marker': {'colorbar': {'title': 'Scale'}}
            }
        ]
    }
    for col in color_columns
]

fig.update_layout(
    updatemenus=[
        {
            'buttons': dropdown_buttons,
            'direction': 'down',
            'showactive': True,
            'x': 0.5,
            'xanchor': 'center',
            'y': 1.19,
            'yanchor': 'top'
        }
    ]
)

fig.update_xaxes(title_text="Annual Precipitation", row=1, col=1)
fig.update_yaxes(title_text="Annual Temperature", row=1, col=1)
fig.update_xaxes(title_text="Annual Precipitation", row=1, col=2)

for annotation in fig['layout']['annotations']:
    annotation['font'] = {'size': 12, 'color': 'black'}

fig.show()

Useful Variables

By trying each numerical variable as a color metric for the scatter plots, four important features that I found are as below:

  1. VWC_Summer_whole
  2. WetSoilDays_Spring_whole
  3. SWA_Fall_whole
  4. Bare

For now, rather than focusing on the seasonality of the variables, lets focus on VWC, WetSoilDays, SWA and Bare. We can infer that it is important to keep the land moist to a certain level and minimize the ‘bareness’ in the area to lower the temperature. Lets move on to each RCP scenario for a more detailed analysis.

Important Features
feature = 'VWC_Summer_whole'
fig = px.scatter(df_con[df_con['year'].isin(range(2060,2099))], x="PPT_Annual", y="T_Annual",
                 color=feature, facet_col="RCP",     
                 labels={
        feature: 'Scale'  
    })
fig.update_layout(
    title={
        'text': f'<b>Annual Precipitation/Temperature ({feature})</b>',
        'x':0.5,  
        'xanchor': 'center'
    },
    title_font=dict(size=20)  
)
fig.show()


feature = 'WetSoilDays_Spring_whole'
fig = px.scatter(df_con[df_con['year'].isin(range(2060,2099))], x="PPT_Annual", y="T_Annual",
                 color=feature, facet_col="RCP",     
                 labels={
        feature: 'Scale' 
    })
fig.update_layout(
    title={
        'text': f'<b>Annual Precipitation/Temperature ({feature})</b>',
        'x':0.5,  
        'xanchor': 'center' 
    },
    title_font=dict(size=20)
)

fig.show()

feature = 'SWA_Fall_whole'
fig = px.scatter(df_con[df_con['year'].isin(range(2060,2099))], x="PPT_Annual", y="T_Annual",
                 color=feature, facet_col="RCP",     
                 labels={
        feature: 'Scale' 
    })
fig.update_layout(
    title={
        'text': f'<b>Annual Precipitation/Temperature ({feature})</b>',
        'x':0.5, 
        'xanchor': 'center'  
    },
    title_font=dict(size=20)
)

fig.show()

feature = 'Bare'
fig = px.scatter(df_con[df_con['year'].isin(range(2060,2099))], x="PPT_Annual", y="T_Annual",
                 color=feature, facet_col="RCP",     
                 labels={
        feature: 'Scale' 
    })
fig.update_layout(
    title={
        'text': f'<b>Annual Precipitation/Temperature ({feature})</b>',
        'x':0.5, 
        'xanchor': 'center',  
    },
    title_font=dict(size=20)
)

fig.show()

Pearson Correlation

What is Pearson Correlation?
Pearson correlation is a measure of the linear relationship between two continuous variables. It quantifies the degree to which a change in one variable predicts a change in another variable, with values ranging from -1 to +1. A value of +1 indicates a perfect positive linear relationship, where an increase in one variable corresponds to a proportional increase in the other. A value of -1 indicates a perfect negative linear relationship, where an increase in one variable corresponds to a proportional decrease in the other. A value of 0 indicates that there is no linear relationship between the variables.

Methodology

We will be calculating correlations on predicted years only and not the observed data marked as ‘hist’ because as we saw in the EDA section, the predictions rise after predicitons and have a rather fluctuating pattern in the past. Since our goal is to find features that correlate with annual temperature, removing history data would create more correlation within the dataset.

RCP4.5

By calculating a Pearson correlation matrix for all numerical variables in the dataset, we can create a heatmap to explore the relationships between different features. This visualization helps identify which features affect each other and highlights those that correlate most strongly with the annual temperature.

In the second plot below which is a plot for features sorted for most correlated with T_Annual, we see that the highest correlators are:

  • Extreme Short Term dry stress
  • Frost days

While these factors cannot be directly influenced, some takeaways from the plot would be that we can take action on the following factors which were the next highly ranked groups to mitigate temperature rise:

  • PET (Potential Evapotranspiration)
  • SWA (Soil Water Availability)
  • VWC (Volumetric Water Content)
Correlation Heatmap
corr_matrix = df_orig[(df_orig['TimePeriod']!='Hist') & (df_orig['RCP']==4.5)].iloc[:,8:].corr()

fig = px.imshow(corr_matrix,
                # text_auto=True,
                labels=dict(color="Correlation"),
                x=corr_matrix.columns,
                y=corr_matrix.columns,
                color_continuous_scale='RdBu_r'
  )

fig.update_layout(
    width=800, 
    height=900,  
    margin=dict(l=10, r=1, t=50, b=10)  
)


fig.update_layout(title_text='<b>Future Correlation Heatmap : RCP 4.5</b>', 
                  title_x=0.5)

fig.update_xaxes(tickfont=dict(size=10))  
fig.update_yaxes(tickfont=dict(size=10))

fig.show()
Correlation Feature Importance
sorted_loadings = corr_matrix['T_Annual'].abs().sort_values(ascending=False)

top_features = sorted_loadings.head(20).index

top_loadings = round(corr_matrix['T_Annual'].loc[top_features,],4)[1:]

colors = ['blue' if val > 0 else 'red' for val in top_loadings]


fig = go.Figure(data=[go.Bar(
    x=top_loadings.index,
    y=top_loadings.abs(),
    text=top_loadings.values,
    textposition='inside',
    marker_color=colors,
    showlegend=False
)])

fig.add_trace(go.Bar(
    x=[None], y=[None],
    marker=dict(color='blue'),
    showlegend=True,
    name='Positive'
))
fig.add_trace(go.Bar(
    x=[None], y=[None],
    marker=dict(color='red'),
    showlegend=True,
    name='Negative'
))

fig.update_layout(
    title='<b>Top Features Correlating to Annual Temperature (RCP = 4.5)</b>',
    xaxis_title='Features',
    yaxis_title='Absolute PCA1 Loadings',
    yaxis=dict(tickformat=".2f"),
    xaxis=dict(
        tickangle=45, 
        tickfont=dict(size=10) 
    ),
    template='plotly_white',
    legend=dict(
        orientation="h",
        yanchor="bottom",
        y=1.02,
        xanchor="right",
        x=1
    ),
    margin=dict(l=20, r=20, b=20, t=60)
)

fig.show()

RCP8.5

In the analysis of Pearson correlation for different RCP (Representative Concentration Pathway) scenarios, it was observed that the ranking of features changed between the RCP 4.5 and RCP 8.5 scenarios. Here’s a detailed comparison:

Similarities:

  1. Higher ranked features are extreme short term dry stress and Frost Days
  2. Lower ranked features are VWC, SWA and PET

Differences:

  1. Correlation gap between higher ranked and lower Ranked Features

    • RCP 4.5 : The transition from higher to lower ranked features seem smooth on the plot and highest ranked VWC feature has a correlation of -0.37.
    • RCP 8.5 : There is an abrupt change when shifting to lower ranked features and the first VWC feature has a correlation of -0.26 which is significantly lower than the RCP 4.5 scenario.
  2. RCP 8.5 Scenario:

    • Correlation decrease in VWC and SWA Features: The number of VWC and SWA features significantly decreased, and their values decreased significantly compared to RCP4.5 correlations. Furthermore, there is a visual drop between PET_Spring and Evap_Spring whereas there was a smooth transition for RCP4.5.

Implications:

  1. RCP 4.5: This scenario suggests room for human actions to influence annual temperature.
  2. RCP 8.5: Under this more extreme scenario, unchangable variables become more dominant, overshadowing the influence of water-related variables, therefore implying that it is harder to influence temperature.

These differences highlight how the impact of climate variables can shift under different greenhouse gas concentration pathways, with temperature effects becoming more pronounced under higher emission scenarios.

What does this mean?
With the knowledge that 8.5 means higher CO2 emission as mentioned in the EDA tab, we can imply that if the RCP8.5 scenario takes place, there is less room for people to prevent temperatures from rising and preserving the environment. This is a logical statement even without the data but the data reinforces the idea by showing less variables in the pearson correlation with the annual temperature.

Correlation Heatmap
corr_matrix = df_orig[(df_orig['TimePeriod']!='Hist') & (df_orig['RCP']==8.5)].iloc[:,8:].corr()


fig = px.imshow(corr_matrix,
                labels=dict(color="Correlation"),
                x=corr_matrix.columns,
                y=corr_matrix.columns,
                color_continuous_scale='RdBu_r'
  )  

fig.update_layout(
    width=800, 
    height=900, 
    margin=dict(l=10, r=1, t=50, b=10)  
)


fig.update_layout(title_text='<b>Future Correlation Heatmap : RCP 8.5</b>',
                  title_x=0.5)  

fig.update_xaxes(tickfont=dict(size=10))
fig.update_yaxes(tickfont=dict(size=10)) 

fig.show()
Correlation Feature Importance
sorted_loadings = corr_matrix['T_Annual'].abs().sort_values(ascending=False)

top_features = sorted_loadings.head(20).index

top_loadings = round(corr_matrix['T_Annual'].loc[top_features,],4)[1:]

colors = ['blue' if val > 0 else 'red' for val in top_loadings]

fig = go.Figure(data=[go.Bar(
    x=top_loadings.index,
    y=top_loadings.abs(),
    text=top_loadings.values,
    textposition='inside',
    marker_color=colors,
    showlegend=False
)])

fig.add_trace(go.Bar(
    x=[None], y=[None],
    marker=dict(color='blue'),
    showlegend=True,
    name='Positive'
))
fig.add_trace(go.Bar(
    x=[None], y=[None],
    marker=dict(color='red'),
    showlegend=True,
    name='Negative'
))

fig.update_layout(
    title='<b>Top Features Correlating to Annual Temperature(RCP = 8.5)</b>',
    xaxis_title='Features',
    yaxis_title='Absolute PCA1 Loadings',
    yaxis=dict(tickformat=".2f"),
    xaxis=dict(
        tickangle=45,  
        tickfont=dict(size=10)  
    ),
    template='plotly_white',
    legend=dict(
        orientation="h",
        yanchor="bottom",
        y=1.02,
        xanchor="right",
        x=1
    ),
    margin=dict(l=20, r=20, b=20, t=60)
)

fig.show()

RMSE

To identify the most significant differences between two scenarios, I employed another method: calculating the RMSE (Root Mean Square Error) for each column. Given that both scenarios have the same number of data points, this approach provided an effective way to quantify the differences. Before performing the RMSE calculations, I standardized the data points to ensure consistency.

When analyzing differences using RMSE, precipitation-related features stood out as the most variable between the two RCP scenarios. Interestingly, it wasn’t always the scenario with the highest precipitation that had the lower temperature. A surprising difference was in PPT_Summer where RCP8.5 had higher average percipitation than RCP4.5. T

What does this mean?
This may be a sign that percipitation itself may not be an important factor when predicting the temperature, but may be used as a ratio to some other feature to predict.

RMSE
df_45 = df_orig[df_orig['RCP'] == 4.5]
df_85 = df_orig[df_orig['RCP'] == 8.5]

df1 = df_45.iloc[:,8:-3]
df2 = df_85.iloc[:,8:-3]

def standardize(df):
    return df.apply(zscore)

z_df1 = standardize(df1)
z_df2 = standardize(df2)

abs_diff_z_scores = np.abs(z_df1 - z_df2)

mean_abs_diff = abs_diff_z_scores.mean()

rmse = np.sqrt(np.mean((z_df1.reset_index(drop=True) - z_df2.reset_index(drop=True))**2, axis=0))

rmse_sort = rmse.sort_values(ascending=False).head(20)

fig = go.Figure(data=[go.Bar(
    x=rmse_sort.keys(),
    y=rmse_sort.values,
    text=[round(i,4) for i in rmse_sort.values], 
    textposition='inside',
    showlegend=False
)])


fig.update_layout(
    title='<b>RCP 4.5 vs 8.5 RMSE of components</b>',
    xaxis_title='Features',
    yaxis_title='Absolute PCA1 Loadings',
    yaxis=dict(tickformat=".2f"),
    template='plotly_white',
    legend=dict(
        orientation="h",
        yanchor="bottom",
        y=1.02,
        xanchor="right",
        x=1
    ),
    margin=dict(l=20, r=20, b=20, t=60)
)

fig.show()

for i, feature in enumerate(rmse_sort.keys()[:5]):
    print("Rank : ", i+1)
    print(f"RCP 4.5 Average {feature} : ", df_45[feature].mean())
    print(f"RCP 8.5 Average {feature} : ", df_85[feature].mean())
    print("")
Rank :  1
RCP 4.5 Average FrostDays_Summer :  0.00022957468556467173
RCP 8.5 Average FrostDays_Summer :  0.00011478734278233587

Rank :  2
RCP 4.5 Average PPT_Fall :  9.65464850475821
RCP 8.5 Average PPT_Fall :  9.665259282743092

Rank :  3
RCP 4.5 Average PPT_Spring :  6.52676644055382
RCP 8.5 Average PPT_Spring :  6.369110562075354

Rank :  4
RCP 4.5 Average Evap_Spring :  4.403702479181158
RCP 8.5 Average Evap_Spring :  4.269294863702628

Rank :  5
RCP 4.5 Average PPT_Summer :  8.145404406740747
RCP 8.5 Average PPT_Summer :  8.249074486053336

Conclusion

Scatterplot
With the scatterplot, we were able to visually identify features that have strong positive and negative correlations with annual temperature. A few variables that we found were VWC_Summer_whole, WetSoilDays_Spring_whole, SWA_Fall_whole, and Bare

Correlation Heatmap
Firstly, the correlation heatmaps give us a direct insight on which features directly correlate with annual temperature. Extreme Short Term Stress, Frost Days, PET, VWC were overlapping features for both RCP scenarios. However, the difference was that the RCP8.5 correlations had a sudden drop for the VWC variables allowing us to hypothesize that human actions will affect the temperature less when done in a scenario where more CO2 emission is happenning.

RMSE
The RMSE gives us a different perspective to features that correlate with annual temperature. For example, by calculating RMSE for each column against each other for different RCP scenarios, we know that percipitation is very different for the two scenarios. However, the higher temperature scenario also has higher percipitation which is contradicting to common logic. This may be a sign that percipitation itself may not be an important factor when predicting the temperature, but may be used as a ratio to some other feature to predict.